Practical 1

Practical 1 aims at analyzing water levels of the longest (i.e. 295km) Swiss river (rises in the Bernese Alps and ends at the junction with the Rhine) at a measuring station in Untersiggenthal next to the Paul Scherrer Institute where nuclear reactors are located, making this location very sensitive in case of flood damage.

Data

The next table displays the first six rows of the niveau data set. The data set collects the Aare’s daily maximum water levels in one unique station in Stilli, Untersiggenthal (canton of Aargau) and records the exact times at which daily maxima are detected.

Stationsname Stationsnummer Parameter Zeitreihe Parametereinheit Gewässer Zeitstempel Zeitpunkt_des_Auftretens Wert Freigabestatus
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-01 00:00:00 2000-01-01 00:23:00 326.245 Freigegeben, validierte Daten
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-02 00:00:00 2000-01-02 00:43:10 326.153 Freigegeben, validierte Daten
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-03 00:00:00 2000-01-03 00:00:00 326.053 Freigegeben, validierte Daten
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-04 00:00:00 2000-01-04 01:43:40 325.871 Freigegeben, validierte Daten
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-05 00:00:00 2000-01-05 21:23:00 325.837 Freigegeben, validierte Daten
Untersiggenthal, Stilli 2205 Pegel Tagesmaxima m ü.M. Aare 2000-01-06 00:00:00 2000-01-06 03:33:05 325.835 Freigegeben, validierte Daten

Time Series

The below plot shows the evolution of daily water maxima over 21 years from 01-01-2000 to 01-08-2021. We see that the lowest troughs are quite constant over the years (~325.5 meter above sea level) whereas the highest peaks really distinguish themselves among the overall peaks. Indeed, we observe that peaks are mainly observed in the Summer months with the highest peaks recorded on 25-08-2005 at 328.827 (m.a.s.l), on 09-06-2007 at 329.323 (m.a.s.l) and on 14-07-2021 at 328.622 (m.a.s.l). Therefore, over 21 years three years have had extremely and unusually high water levels during the summer, respectively 2005, 2007 and 2021.

We also observe a pattern of the maxima better shown by the green smoothed curve which indicates that water levels fluctuate according to a specific logic within a year (e.g. heavier rain in November and April).

Daily Water Level Maxima’s Distribution

The following output shows the five-number statistics summary of the water level values. The summary shows that the water level values are not symmetrical around the mean nor around the median but much more tightly grouped on the left hand side of the mean (and the median) than on its right hand side (i.e. more dispersed), indicating that the water level maxima’s distribution is right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   325.4   325.6   325.8   325.9   326.1   329.3

The below histogram (i.e. frequency distribution) confirms the above observations. Indeed, we see that the data is right-skewed (i.e. therefore non-normal) indicating that the data is disproportionately distributed on the right where daily maxima happen to have much higher values than usual (i.e. outliers) that need to be investigated.

The blue curve represents the smoothed version of the histogram which reflects more accurately the probability density function of the values. The dark green vertical line draws the median water level value, 325.8 meter above sea, and the light green vertical line draws the average, 325.9.

Looking at the positions on the x-axis of the mean and the median [and the mode] with regard to the distribution, we see that the mean seems to be a better indicator of the center of the distribution, even though it is not centered around it.

Extreme Value Analysis (EVA)

The objective behind the EVA is to find valid estimates of the frequency (expressed as a return period) of extreme water level maxima. Extreme values are usually infrequent but have a large impact on whatever is the focus on interest (here flood risk) which is why they represents the tail of the distribution.

Peak-Over-Threshold (POT) Method

The Peak-Over-Threshold method provides a model for independent exceedences (i.e. extreme values) above a large threshold. Therefore, it identifies values that are high enough to be above a designated threshold \(\mu\), above which we consider are the extremes (i.e. highest values). In order to determine an optimal threshold, we apply the MRL-plot and then look at the distribution of the exceedences. The value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.

Therefore, to model the high water levels, we first proceed to an MRL-plot to choose the optimal threshold \(u\) and then, we use the Peak-over-Threshold method and look at the distribution of the exceedences to assess their probability of occurrence.

Clustering of the extremes

Clusters of the extremes correspond to the clustering of the data points that are above the chosen threshold \(u\).

To measure and deal with clustering of the daily water levels, we consider consecutive threshold exceedences to belong to the same cluster. Therefore, by using the POT approach, we can observe on the Peak-Over-Threshold plot (below) the different clusters of the extreme values. Then, we can fit the Generalized Pareto Distribution (GPD) or Point Process model to the cluster maxima (after declustering if the exceedences exhibit autocorrrelation).

Analysis

Threshold Selection

First, we start by selecting an appropriate threshold \(u\) to use for fitting the GPD or Point Process models. To do so, we use the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess.

The mean excess function \(e(u) = E(X - u| X > u)\) is the expected value of the excess (i.e. difference between the exceedance and the high threshold (i.e. \(u\))), knowing that the exceedance is greater than the threshold.

We are interested in finding the distribution of the daily water level maxima that are above the threshold \(u\) (i.e. exceedences), which is the excess distribution function \(F_u(x) = P(X - u \leq x | X > u)\).

The difficulty of choosing an adequate threshold \(u\) lies in the fact that it requires a good balance between selecting a threshold low enough so that we have enough exceedences to obtain useful results (i.e. results precision) but high enough so that the GPD and Point Process methods are approximately valid (i.e. bias).

[Notes for ourselves in case of a question: “If we take the threshold to be so low that most of the data are exceedences then the analysis will only be valid if the raw data came from a generalized Pareto distribution (which will rarely be the case) = 80% of consequances (i.e. exceedences) come from 20% of the causes”]

The below Mean Residual Life Plot indicates for each potential threshold \(u\) its mean excess. As mentioned previously, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold. The dashed curves represent the confidence intervals of the mean excesses which help assess its type (i.e. constant, linear). Therefore, we decide to choose a threshold \(u = 327\ [m.a.s.l]\) form which the curve seems to become linear.

From now on, daily water level maxima above \(327\ [m.a.s.l]\) are considered extreme values.

Therefore, below \(327\ [m.a.s.l]\), a threshold would be too low for the assumptions of the extreme value approach to be valid and above \(327\ [m.a.s.l]\) would be too high to obtain insghtful results because of too few data.

Considering that the mean excess function is used to validate a GPD model for the excess distribution, we see that the mean excess function would roughly be a straight line if we could zoom out on the y-axis, we can say that the GPD is a good fit for modelling the exceedences.

Peaks-Over-Threshold Plot

The below plot shows the results of the Peaks-Over-Treshold method. The red line represents the designated appropriate threshold \(u = 327\ [m.a.s.l]\) and the red dots are the extreme daily water maxima recorded and considered as exceedences.

The POT approach states that the observations above the high threshold \(u\) (i.e. red observations above 327 m.a.s.l) are in absolute terms not independent but it states that if they are not consecutive above \(u\), they belong to different clusters and therefore are considered independent only if they are at least a certain number of consecutive observations between them.

To avoid dependence between exceedences, we proceed to decluster the observations above the threshold \(u\). To do so, the declustering keeps only the highest water level maxima of one cluster as the key representative observation of the cluster. The maxima identified within each clusters will then be fit to the GPD.

The following plot shows that some exceedences above \(327 m.a.s.l\) become more transparent and some others remain more visible. The more visible observations represent the 79 observations (one observation per cluster declustered) obtained after the declustering.

## 
##  niveau$Wert  declustered via runs  declustering.
## 
##  Estimated extremal index (intervals estimate) =  0.6758134 
## 
##  Number of clusters =  79 
## 
##  Run length =  1

Generalized Pareto Distribution (GPD)

The GPD is used to describe statistical properties of excesses. Indeed, Extreme Value theory suggests that GPD is a natural approximation of the excess distribution function \(F_u(x)\) where \(x\) is the excess, so \(F_u(x) \approx G_{\xi, \beta}(x)\). [maybe add parameters conditions]

Under the assumption that exceedences are independent (i.e. probability of occurrence of one exceedance does not affect the probability of another), identically distributed (i.e. are random variables) and their distribution is asymptotic (i.e. ?), we fit the GDP model on the daily water level exceedences.

Return Levels (50- and 100-year events)

The below output shows the return level estimates for 50-year and 100-year events.

## extRemes::fevd(x = unlist(decluster), threshold = threshold, 
##     type = "GP", time.units = "365/year")
## get(paste("return.level.fevd.", newcl, sep = ""))(x = x, return.period = return.period)
## 
##  GP model fitted to  unlist(decluster)  
## Data are assumed to be  stationary 
## [1] "Return Levels for period units in years"
##  50-year level 100-year level 
##       329.1207       329.3560

The below plot shows the return level estimates by return period and we see that it matches the values from above.

Block Maxima Approach

  • Advantages:

The Block Maxima method gives more efficient results when less exceedences are observed. This method may also be preferable when the observations are not exactly independent and identically distributed. For example, when there is seasonal periodicity in case of yearly maxima or, when there is short range dependence that plays a role within blocks but not between blocks.

  • Drawback:

It is less flexible than POT since in practice it might be hard to change the block size while it is easy in POT to move the threshold continuously.

Practical 2

Candlestick Plot

Stocks Gathered

Below is displayed the candlestick plot including the four stocks prices. Since Amazon has a much higher stock price (~1000$) than other stocks, it takes the most prominent size on the below plot compared to the others. You can use the zoom to see the detail of the three other stocks price evolution over time.

Stock Prices Separately

Log-Returns

For all four stocks, it seems that the mean of the log-returns is 0 or close to 0 and that they are stationary. AAPL and AMZN become more volatile in 2019 and their shapes seem to be skewed. NVDA and TSLA seem to be more volatile and their shapes seem to indicate they follow a normal distribution.

# Peak-Over-Threshold

We choose a threshold of 0.05 based on the mean residual plot. We plot the mean over and we use a peak over threshold model and display its estimate of the uncertainty

## [1] "uncertainty"
##    rlevel     shape 
## 0.1134840 0.2487138

We will now fit our stocks data of the upper tail beyond the chosen threshold of 0.05 to a GPD method.

## $threshold
## [1] 0.05
## 
## $nexc
## [1] 32
## 
## $conv
## [1] 0
## 
## $nllh
## [1] -74.3385
## 
## $mle
## [1] 0.01965532 0.60647468
## 
## $rate
## [1] 0.04238411
## 
## $se
## [1] 0.00558524 0.24858608

By using the POT approach, we built a model for high values of the negative log-returns where we obtain the Maximum Likelihood Estimates for the scale (sigma) and shape (ksi) parameters/coefficients : 0.01965532 and 0.60647468, with Standard Errors being 0.00558524 and 0.24858608.

Now we compute the 95% value at risk and we plot this value on a histogram

##          5% 
## -0.04455341

Practical 3

Informative Plots

Time Series

Since we do not see a trend, we understand that the time series are stationary (i.e. depend on the time of the observation).

From the plot below, we do see that many of the high loads and peaks were from searches done on members of the British Royal family such as Queen Victoria and Elizabeth II. These peaks are not linked to any seasonal trends and are instead due to independent events. For instance, the peaks increase massively as of 2016 in relation to the royal family members. One potential reason for this could be the arrival of the Netflix Tv Series “The Crown” which debuted in 2016 and details the lives of the royals and which is watched by many people, hence why searches could be so high. Moreover, with members such as Elizabeth II and Prince Philip being such important figures and rather elderly, there have often been health scares which could have also sparked high searches in regards to their well-being and updates. We can also see that the Summer Olympic searches are high around the time of it happening or building up to the event. However, once the Olympic games ended, the searches fall drastically, as to be expected.

Bar plot

We have made a bar plot depicting the frequency of views and searches on a given topic/type.

The bar plot shows the distribution of total daily count relative to the different time series. We see that the biggest frequency is generated by web searches of all of the different topics that we have. The black line is the median of the total daily count. The total count falls after this median line which indicates that the total daily count is not normally distributed but is right-skewed meaning that the tail of the distribution displays extreme values on the right hand side of the median.

Modified Series

\[ R_i = \frac{X_i - X_{i-7}}{X_{i-7}} * 100 \] \(R_i\) is the relative weekly change (increase or decrease) in percentage.

When plotting the relative weekly change we see the highs and lows more clearly. There is a massive high load in the traffic count April 2016 for Princess Margaret and another big but smaller peak for Winston Churchill in February 2016.

Peaks-Over-Threshold

For each time series (\(R_i,\ i = 1, ..., 12\)), we plot their daily count distribution on a Q-Q plot to assess if their data come from a theoretical distribution (e.g. Normal). Using the qq draws the correlation between a given sample and the normal distribution. The geom_line(…, distribution = stats::qqnorm) function visually checks the normality assumption of the distribution. If the assumption does not hold, we look into how/what data points contribute to the violation.

Results: on all below Q-Q plots, we observe that the daily count distributions follow a normal distribution when in the bulk of the distribution but departs from when in the tail indicating that the data is skewed. Indeed, it is right-skewed (i.e. positively skewed).

Knowing that the data is right-skewed we generate the time series’ Mean Residual Plot using the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess. The following plots are used to choose the most adequate thresholds. Indeed, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.

After setting their theoretical optimal threshold, we plot their Peaks-Over-Threshold Plot with their threshold \(u\) to display the exceedences. Note that if a threshold is too low then the extreme value approach cannot be valid and that if it is too high we cannot obtain insightful results because of too few data.

For the following modified time series where the POT is not suitable, we suggest using a Block Maxima approach.

Elizabeth II

Distribution

Mean Residual Plot

The selected threshold \(u\) is 125.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few data points exceeding the threshold nor to many. Therefore, we assume that using the POT model is suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it a Bimodal distribution.

United States

Distribution

Mean Residual Plot

The selected threshold \(u\) is 35.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow quite well the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it is a Bimodal distribution.

Queen Victoria

Distribution

Mean Residual Plot

The selected threshold \(u\) is 100.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it seems to be a Bimodal distribution.

World War II

Distribution

Mean Residual Plot

The selected threshold \(u\) is 30.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the distribution, here the data seems right-skewed.

World War I

Distribution

Mean Residual Plot

The selected threshold \(u\) is 25.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.

George VI

Distribution

Mean Residual Plot

The selected threshold \(u\) is 200.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here it seems to be a Bimodal distribution.

United Kingdom

Distribution

Mean Residual Plot

The selected threshold \(u\) is 50.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.

Princess Margaret, Countess of Snowdon

Distribution

Mean Residual Plot

The selected threshold \(u\) is 700.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line whether in the bulk of the distribution or at the end. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.

Prince Philip, Duke of Edinburgh

Distribution

Mean Residual Plot

The selected threshold \(u\) is 400.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Winston Churchill

Distribution

Mean Residual Plot

The selected threshold \(u\) is 100.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Diana, Princess of Wales

Distribution

Mean Residual Plot

The selected threshold \(u\) is 70.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor to many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the Residual distribution seems to be right-skewed.

2016 Summer Olympics

Distribution

Mean Residual Plot

The selected threshold \(u\) is 60.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

GPD Fit - Diagnostic Plot

The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.

99-Quantile

The following outcome shows the 99%-quantile and uncertainty measures of the modified daily count modified for each time series. To generate these results we use the quantile function of the stats package and the standard error of the peaks-over-threshold model.

Detecting Simultaneous High Load

Graphical Method

The graphical method that we suggest for detecting simultaneous high traffic loads is a Block Maxima interactive plot colored by topic. This allows to visually see the daily maxima between the 12 different web pages. If for one or several days that display high maxima but also other value from other webpages that are just below, it indicates a simultaneous high load.

By pointing the mouse cursor over data points, more information is available. With this plot, we can observe that Princess Margret, Georges IV, and Prince Philip of Edinburgh have the largest simultaneous high loads around 2016 November 6, but also have another smaller simultaneous high loads in 2016 April 21. So the idea for Wikipedia is that if one of these web pages load is increasing they should do caching on the other ones to prevent exhausting of available resources.

Another alternative method would be to generate scatter plots to visualize how two distributions move together.

Numerical Method

For the numerical method, we compute a matrix of the tail dependence coefficients between the different web pages. We are interested in the values that are the closest to 1. They indicate that there is a high likelihood that if an extreme value in the tail arises for one web page, the other webpage is very likely to also display an extreme value in its high tail. Therefore, there is tail dependence and there is probably going to show a simultaneous high load for both (or more than two) pages. In this case, Wikipedia must anticipate and use the caching system on them.

Here, we use the tdc function form the FRAPO package.

2016_Summer_Olympics Diana,_Princess_of_Wales Elizabeth_II George_VI Prince_Philip,_Duke_of_Edinburgh Princess_Margaret,_Countess_of_Snowdon Queen_Victoria United_Kingdom United_States Winston_Churchill World_War_I World_War_II
2016_Summer_Olympics 1.000 0.071 0.000 0.071 0.036 0.107 0.036 0.107 0.214 0.036 0.000 0.036
Diana,_Princess_of_Wales 0.071 1.000 0.071 0.071 0.143 0.071 0.071 0.036 0.036 0.000 0.000 0.036
Elizabeth_II 0.000 0.071 1.000 0.464 0.607 0.464 0.286 0.107 0.000 0.036 0.143 0.000
George_VI 0.071 0.071 0.464 1.000 0.536 0.571 0.321 0.071 0.036 0.071 0.107 0.036
Prince_Philip,_Duke_of_Edinburgh 0.036 0.143 0.607 0.536 1.000 0.536 0.214 0.071 0.000 0.036 0.107 0.036
Princess_Margaret,_Countess_of_Snowdon 0.107 0.071 0.464 0.571 0.536 1.000 0.214 0.071 0.071 0.000 0.107 0.036
Queen_Victoria 0.036 0.071 0.286 0.321 0.214 0.214 1.000 0.036 0.000 0.000 0.107 0.000
United_Kingdom 0.107 0.036 0.107 0.071 0.071 0.071 0.036 1.000 0.036 0.179 0.107 0.107
United_States 0.214 0.036 0.000 0.036 0.000 0.071 0.000 0.036 1.000 0.071 0.179 0.179
Winston_Churchill 0.036 0.000 0.036 0.071 0.036 0.000 0.000 0.179 0.071 1.000 0.107 0.071
World_War_I 0.000 0.000 0.143 0.107 0.107 0.107 0.107 0.107 0.179 0.107 1.000 0.429
World_War_II 0.036 0.036 0.000 0.036 0.036 0.036 0.000 0.107 0.179 0.071 0.429 1.000